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Abstract —Audio fingerprinting, also named as audio hashing, 
has been well-known as a powerful technique to perform au¬ 
dio identification and synchronization. It basically involves two 
major steps: fingerprint (voice pattern) design and matching 
search. While the first step concerns the derivation of a robust 
and compact audio signature, the second step usually requires 
knowledge about database and quick-search algorithms. Though 
this technique offers a wide range of real-world applications, to 
the best of the authors’ knowledge, a comprehensive survey of 
existing algorithms appeared more than eight years ago. Thus, 
in this paper, we present a more up-to-date review and, for 
emphasizing on the audio signal processing aspect, we focus our 
state-of-the-art survey on the fingerprint design step for which 
various audio features and their tractable statistical models are 
discussed. 

Keywords-Voice pattern; audio identification and synchroniza¬ 
tion; spectral features; statistical models. 

I. Introduction 

Real-time user interactive applications have emerged nowa¬ 
days thanks to the increased power of mobile devices and their 
Internet access speed. Let us consider applications like music 
recognition IDG), e.g., people hear a song in a public place 
and they want to know more about it, or personalized TV 
entertainment Glil. e.g., people want to see more service and 
related content on the Web in addition to the main view from 
TV; both require a fast and reliable audio identification system 
in order to match the observed audio signal with its origin 
stored in a large database. For these purposes, several research 
directions have been studied, such as audio fingerprinting Q, 
audio watermarking 0 , and timeline insertion 0 . While 
watermarking and timeline approaches both require to embed 
signature into the original media content, which is sometimes 
inconvenient for the considered applications, fingerprinting 
technique allows directly monitoring the data for identification. 
Hence, audio fingerprinting has been widely investigated in the 
literature and already been deployed in many commercialized 
products ID Ola ©ED EE). This technique has recently been 
exploited for other applications such as media content syn¬ 
chronization mm, multiple video clustering M, repeating 
object detection lfl5l . and live version identification lfl5) . 

A general architecture for an audio fingerprinting system, 
which can be used for either audio identification or audio 
synchronization purpose, is depicted in Fig. [T] The fingerprint 
extraction derives a set of relevant audio features followed by 
an optional post-processing and feature modeling. Fingerprints 
of the original audio collection and its corresponding metadata 
(e.g., audio ID, name, time frame index, etc.) are systematically 


stored in a database. Then given a short recording from the 
user side, its feature vectors (i.e fingerprints) are computed in 
the same way as they were for the original data. Finally, a 
searching algorithm will find the best match between these 
fingerprints with those stored in the database so that the 
recorded audio signal is labeled by the matched metadata. 


Original audio signals 
(content database) 
-> 


Fingerprint 

Extraction 


Fingerprints Fingerprint 
+ database 

Metadata 



Figure 1: General architecture of an audio fingerprinting system. 


In real-world recording, the audio signal often undergoes 
many kinds of distortion: acoustical reverberation, background 
noise addition, quantization error, etc. Thus, the derived fin¬ 
gerprints must be robust with respect to these various signal 
degradations. Beside, the fingerprint size should be as small 
as possible to save memory resources and to allow real¬ 
time matching. The details of general properties of the audio 
fingerprint was well-discussed in ifTl lfl6llfl7l . In order to fulfil 
those requirements, audio sample signal is often transformed 
into Time-Frequency (T-F) domain via the Short Time Fourier 
transform (STFT) |16) where numerous distinguishable char¬ 
acteristics such as high-level musical attributes, e.g., predom¬ 
inant pitch, harmony structure, or low level spectral features, 
e.g., mel-frequency cepstrum, spectral centroids, spectral note 
onsets, etc., are exploited. To further compact the fingerprints, 
some approaches continue to fit the spectral feature vectors 
to a statistical model, e.g., Gaussian Mixture Model (GMM) 
f]~8l . Hidden Markov Model (HMM) ff9l . so that in the end 
only the set of model parameters are used as fingerprints. 














Though diverse fingerprinting algorithms have been pro¬ 
posed in the literature, the number of review papers remains 
limited where, to the best of the authors’ knowledge, a com¬ 
prehensive review of fingerprinting algorithms was presented 
more than eight years ago 0 ED, and a more recent survey 
lf20l only focusing on computer vision based approaches (e.g., 
methods proposed in |[2Tlll22l ). In this paper, we present a 
more up-to-date review of the domain, with particular focus 
concerning the fingerprint extraction block in Fig. [T] where 
various audio spectral features and their statistical models are 
summarized systematically. The presentation would particu¬ 
larly benefit new researchers in the domain and engineers in 
the sense that they would easily follow the described steps to 
implement different audio fingerprints. 

The structure of the rest of the paper is as follows. We first 
present a general architecture for fingerprint design in Section 
[III we then review various audio features, which have been 
extensively exploited in the literature, in Section [111] The detail 
of some statistical feature models is introduced in Section Hvl 
Finally, we conclude in Section [V] 

II. General architecture of fingerprint design 

Fig. [2] depicts a general workflow of the fingerprint design. 
The purpose of each block is summarized as follows: 



Fingerprint 


Fingerprint 


Figure 2: General workflow of the fingerprint design. 


• Pre-processing: in this step, input audio signal is 

often first digitalized (if necessary), re-sampled to 
a target sampling rate, and bandpass filtered. Other 
types of processing includes decorrelation and am¬ 
plitude normalization M- Then the processed signal 
is segmented into overlapping time frames where a 
linear transformation, e.g., Fast Fourier Transform 
(FFT), Discrete Cosine transform (DCT), or wavelet 
transform lfl6l . is applied to each frame. At this stage, 
the input time-domain signal is represented in a feature 
domain, and the most popular feature domain is time- 
frequency representation given by the STFT. 


• Feature extraction', this is a major process since the 
choice of ’’which feature is used” will directly affect 
the performance of the entire fingerprinting system. 
A great diversity of features have been investigated 
targeting the reduction of dimensionality as well as 
the invariance to various distortions. For summary, 
most approaches first map the linear time-frequency 
representation given by the STFT to an auditory- 
motivated frequency scale, i.e., Mel, Bark, Log, or 
Cent scale, via filterbanks ana. This mapping step 
greatly reduces the spectrogram size since the number 
of filterbanks is usually much smaller than the FFT 
length. Then a feature vector such as Mel-Frequency 
Cepstral Coefficients (MFCC), spectral centroids of 
all subbands, etc., are computed for each time frame. 
In some systems, the first and second derivatives 
of the feature vectors are also integrated to better 
track the temporal variation of audio signals mol- 
Other types of feature that worth mentioning are e.g., 
time localized frequency peak m, time-frequency 
energy peak location m, or those developed in im¬ 
age processing based approaches such as top-wavelet 
coefficients computed on the spectral image ll22ll and 
multiscale Gabor atoms extracted by Matching Pursuit 
algorithm El. Recently, a general framework for dic¬ 
tionary based feature learning has also been introduced 

(23). 

• Post-processing : the feature vectors computed in the 
previous step are often real-valued and the absolute 
range depends on the signal power. Therefore when 
Euclidean distance is used in the matching step, mean 
substraction and component wise variance normaliza¬ 
tion are recommended l26lll27l . Another popular post¬ 
processing is quantization where each entry of the 
feature vectors is quantized to a binary number in 
order to gain robustness against distortions and, more 
importantly, to obtain memory efficiency SUED HU 
|fl5l . In many existing system, fingerprint is achieved 
after this step. 

• Feature modeling : this block is sometimes deployed 
in order to further compact the fingerprint. In this 
case, a large number of feature vectors along time 
frames is fitted to a statistical model so that an 
input audio signal is well-characterized by the model 
parameters, which are then stored as a fingerprint 
mmm. Popular model includes Gaussian Mix¬ 
ture Model (GMM), Hidden Markov Model (HMM). 
Other approaches used decomposition techniques, e.g.. 
Non-negative MAtrix Factorization (NMF), to help 
decreasing data dimension and therefore to reduce 
the local statistical redundancy of the feature vectors 

H3Ufi32). 

Since the pre-processing and post-processing steps are 
quite straightforward, in the following of the paper we will 
present more detail only on the feature extraction and the 
feature modeling blocks. 

III. Feature extraction 

Summarizing numerous types of audio features used for the 
fingerprint design so far will certainly go beyond the scope of 










this paper. Thus in this section, we select to present the most 
popular low level features in the spectral domain only. 

A. MFCC 

MFCC is one of the most popular feature considered in 
speech recognition where the amplitude spectrum of input 
audio signal is first weighted by triangular filters spaced 
according to the Mel scale, and DCT is then applied to 
decorrelate the Mel-spectral vectors. MFCC was shown to 
be applicable for music signal also in l33l . Examples of 
fingerprinting algorithms used MFCC feature are found in 
ll33lflT8l . In (34), MFCC was used also for clustering and 
synchronizing large scale audio-video sequences recorded by 
multiple users during an event. Matlab implementations for the 
computation of MFCC are available (35) (36). 

B. Spectral Energy Peak (SEP) 

SEP for music identification systems was described in 
ED ID where a time-frequency point is considered as a peak 
if it has higher amplitude than its neighboring points. SEP is 
argued to be intrinsically robust to even high level background 
noise and can provide discrimination in sound mixtures (38) . In 
well-known Shazam’s system (T) time-frequency coordinates 
of the energy peaks was described as sparse landmark points. 
Then by using pairs of landmark points rather than single 
points, the fingerprints exploited the spectral structure of sound 
sources. This landmark feature can also be found in OH 
and (39l for multiple video clustering. Ramona et al. used 
start times of the spectral energy peaks, referred to as onsets, 
for the automatic alignment of audio occurrences in their 
fingerprinting system (23) (40). 

C. Spectral Band Energy (SBE) 

Together with spectral peak, SBE has been widely ex¬ 
ploited in fingerprinting algorithms. Let us denote by s(n, /) a 
STFT coefficient of an audio signal at time frame index n and 
frequency bin index /, 1 < / < M. Let us also denote by b 
an auditory-motivated subband index, i.e., in either Mel, Bark, 
Log, or Cent scale, and Z& and hb the lower and upper edges 
of fo-th subband. SBE is then computed, with normalization, 
in each time frame and each frequency subband range by 

pSBE S?UI«(n,/)| a 

EjLxKn./) I 2 ' 

Haitsma et al. proposed a famous fingerprint in (2) where 
SBEs were first computed in a block containing 257 time 
frames and 33 Bark-scale frequency subbands, then each F^ BE 
was quantized to a binary value (either 0 or 1) based on its dif¬ 
ferences compared to neighboring points. Other fingerprinting 
algorithms exploiting SBE feature were found for instance in 
ED (HD- Variances of this subband energy difference feature 
can be found in more recent approaches (28) ED. 

D. Spectral Flatness Measure (SFM) 

SFM, also known as Wiener entropy, relates to the tonality 
aspect of audio signals and it is therefore often used to 
distinguish different recordings. SFM is computed in each 
time-frequency subband point (n, b ) as 

pSFM = 


A high SFM indicates the similarity of signal power over 
all frequencies while a low SFM means that signal power is 
concentrated in a relatively small number of frequencies over 
the full subband. 

A similarly feature to SFM, which is also a measure of the 
tonal-like or noise-like characteristic of audio signal and was 
exploited as fingerprint, is spectral crest factor (SCF). SCF is 
computed by 


pSCF _ 

h — 


max /e[4,Vl(l' s ( n >/)| 2 ) 


hb~ ^b + 1 ^f — lb 


(3) 


SFM and SCF were found to be the most promising 
features for audio matching with common distortions in (42) 
and were both considered in other fingerprinting algorithms 

EQQD- 


E. Spectral Centroid (SC) 

SC is also a popular measure used in audio signal pro¬ 
cessing to indicate where the ’’center of mass” of a subband 
spectrum is. It is formulated as 

F sc = E% lb f-\s(nJ)\* 

n ’ b E h fh b \s(nJ )\ 2 ’ 

SC was argued to be robust over equalization, compression, 
and noise addition. It was reported in (26) and (HO that 
SC-based fingerprints offered better audio recognition than 
MFCC-based fingerprints with 3 to 4 second length audio 
clips. In our preliminary experiment with speech utterances 
distorted by reverberation and real-world background noise, 
we also observed that SC-based fingerprints resulted in higher 
recognition accuracy than MFCC-, SBR-, and SFM-based 
fingerprints without post-processing. 

Given one of the feature parameters F n b computed in each 
time-frequency subband point (n, b ) as described above, a d- 
dimensional feature vector F n = [F n , i,..., F rh< j] T is formed to 
describe the corresponding characteristic of the signal at time 
frame n, where T denotes vector transpose and d is the total 
number of subbands. When the first and second derivatives 
of the feature vectors are additionally considered, for better 
characterizing the temporal variation of audio signal, F„ will 
then be a 3c/-dimensional vector m before passing to the 
post-processing block shown in Figure [2] 


IV. Feature modeling 

In some systems, in order to further compact the finger¬ 
print the feature vectors F„ can be adapted to a statistical 
model. This step allows to reduce the global redundancy of 
spectral features. As a result, a long sequence of feature 
vectors F n ,n = \..... N is characterized by a significantly 
smaller number of the model parameters while ensuring the 
discriminative power. In this section we review the use of 
three popular models, namely gaussian mixture model (GMM), 
hidden Markov model (HMM), and nonnegative matrix factor¬ 
ization (NMF), for the fingerprint design. 









A. GMM-based fingerprint 

GMM has been used to model the spectral shape of audio 
signals in many different applications ranging from speaker 
identification l43l to speech enhancement ll30l . etc. It was also 
investigated for audio fingerprinting by Ramalingam and Rr- 
ishnan COD, where spectral feature vectors F n are modeled as 
a multidimensional K-state Gaussian mixture with probability 
density function (pdf) given by 


B. HMM-based fingerprint 

HMM is a well-known model in many audio processing 
applications ll45l . When applied for audio fingerprinting, pdf 
of the observed feature vector F n can be written as 

P(F ra ) = ^ ' Ttq 1 bq 1 (F nj l)a qi q 2 b q2 {F ny 2) 

qi,Q 2 ,---,qd 

■■■ a q d - iqd bq d (F nc l) (12) 


K 

p(F n ) = ^2 a fcA/' c (F n |/i fc , S fc ) ( 5 ) 

fc=l 

where ak, which satisfies X^=i a k = 1, Pk and S/ c are the 
weight, the mean vector and the covariance matrix of the fc-th 
state, respectively, and 

M c (F n \p k , S fc ) = ^ e -( F "-'“) , ' E ‘ 1 (F"-«) (6) 

|7T2jfe| 

where H and |.| denote conjugate transpose and determi¬ 
nant of a matrix, respectively. The model parameters 9 = 
{a.k,Pk, Sfejfc are then estimated in the maximum likelihood 
(ML) sense via the expectation-maximization (EM) algorithm, 
which is well-known as an appropriate choice in this case, with 
the global log-likelihood defined as 

N 

£ml = £>gp(F„| 0 ). ( 7 ) 

n= 1 

As a result, the parameters are iteratively updated via two 
EM steps as follow: 

• E-step: compute the posterior probability that feature 
vector F„ is generated from the A:-th GMM state 


where tt q . denotes the probability that (/,; is the initial state, 
a qiqj is state transition probability, and b qi (F „ t is pdf for a 
given state. 

Given a sequence of observations F n ,n = l..... N ex¬ 
tracted from a labeled audio signal, the model parameters 
6 = { 7 t qi , a qiqj ,b qi }ij are learned via e.g., EM algorithm 
(detail formulation can be found in |[45l ) and stored as a fin¬ 
gerprint. Cano et al. modeled MFCC feature vectors by HMM 
in their AudioDNA fingerprint system l29l . In flT9l HMM- 
based fingerprint was shown to achieve a high compaction by 
exploiting structural redundancies on music and to be robust 
to distortions. 

Note that when applying GMM or HMM for the fingerprint 
design, a captured signal at the user side is considered to be 
matched with an original signal fingerprinted by the model 
parameter 9 in the database if its corresponding feature vectors 
F„ are most likely generated by 9. 

C. NMF-based fingerprint 

NMF is well-known as an efficient decomposition tech¬ 
nique which helps reducing data dimension l46l . It has been 
widely considered in audio and music processing, especially 
for audio source separation mm. When applying in the 
context of audio fingerprinting, a d x N matrix of the feature 
vectors V = [Fi,.... F ; v] is approximated by 


a-kP{ F„|/z fe ,S fe ) 

/ “/nk — jv- • 

y2i =1 aip(F n \pi,T,i) 

M-step: update the parameters 
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H 


( 8 ) 

(9) 

GO) 

( 11 ) 


With GMM, N d-dimensional feature vectors F„ are char¬ 
acterized by K set of GMM parameters {ctfc, p k , Fik}k=i,...,K 
where K is often very small compared to N. However, 
since GMM does not explicitly model the amplitude variation 
of sound sources, signals with different amplitude level but 
similar spectral shape may result in different estimated mean 
and covariance templates. To overcome this issue, another 
version of GMM called spectral Gaussian scaled mixture 
model (GSMM) could be considered instead. Though GSMM 
has been used in speech enhancement ( 1301/ and audio source 
separation /E3/, it has yet been applied in the context of 
fingerprinting. 


V = WH 


(13) 


where W and H are non-negative matrices of size dx Q and 
Q x TV, respectively, modeling the spectral characteristics of 
the signal and its temporal activation, and Q is much smaller 
than N. The model parameters 0 = {W, H} can be estimated 
by minimizing the following cost function: 

C(0)=^^([V] M |[WHy, (14) 

bn 


where dis{x\y) = | — log | — 1 is Itakura-Saito (IS) 
divergence, and [A]& )Tl denotes an entry of matrix A at b- 
th row and ?i-th column. The resulting multiplicative update 
(MU) rules for parameter estimation write l49l : 


H H © 

WfWQ 


W T ((WH)- -2 © V 


H 

(WH)' -1 H T 


W T (WH)' -J 

((wh)- 2 ©v) 


(15) 

(16) 


where © denotes the Hadamard entry wise product, A p being 
the matrix with entries [A]L, and the division is entrywise. 
Fingerprints are then generated compactly from the basis 
matrix W, which has much smaller size compared to the 
feature matrix V. 










NMF was applied to the spectral subband energy matrix 
in ll32ll and to the MFCC matrix in l50l . The resulting fin¬ 
gerprint was shown to better identify audio clips than another 
decomposition technique namely singular value decomposition 
(SVD). 

V. Conclusion 

In this paper, we presented a review of the existing audio 
fingerprinting systems which have been developed by numer¬ 
ous researchers during the last decade for a range of practical 
applications. We described a variety of audio features and 
reviewed state-of-the-art approaches exploiting them for the 
fingerprint design. Furthermore, the use of statistical models 
and decomposition techniques to reduce the global statistical 
redundancy of feature vectors, and therefore to decrease finger¬ 
print size, was also summarized. As a result, the combination 
of different presenting features and/or the deployment of a 
statistical feature model afterward are both applicable to obtain 
a robust and compact audio signature. 
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